Trends on Stock Market using Exploratory Data Analysis and Machine Learning

Nifty50 Stock Data Analysis

The NIFTY 50 index is National Stock Exchange of India's benchmark broad based stock market index for the Indian equity market. NIFTY 50 stands for National Index Fifty, and represents the weighted average of 50 Indian company stocks in 17 sectors. It is one of the two main stock indices used in India, the other being the BSE Sensex. Source:Wikipedia

The NIFTY 50 is a diversified 50 stock index accounting for 13 sectors (as on 30 April 2021) of the economy. It is used for a variety of purposes such as benchmarking fund portfolios, index based derivatives and index funds.

  • NIFTY 50 is owned and managed by National Stock Exchange(NSE) Indices Limited.
  • The NIFTY 50 Index represents about 66.8% of the free float market capitalization( whose components are weighted according to the total market value of their outstanding shares)of the stocks listed on NSE.
  • The total traded value of NIFTY 50 index constituents for the last six months ending March 2019 is approximately 53.4% of the traded value of all stocks on the NSE.
  • Impact cost of the NIFTY 50 for a portfolio size of Rs.50 lakhs is 0.02% for the month March 2019.
  • NIFTY 50 is ideal for derivatives trading.

source:https://stableinvestor.com/2019/09/explained-nifty-indices-nifty50.html
From above Nifty 500 classification, we will concern only about Nifty 50.

In [51]:
#https://www.google.com/maps/search/nse+india+location+mumbai/@19.0601528,72.8597672,20.11z is from Google Map
nse_map = folium.Map(location=[19.060288332910762, 72.85984823112783], zoom_start=12, max_zoom=13)
tooltip = "National Stock Exchange of India"
folium.Marker(
    [19.06028008587612, 72.85987316084454], popup="<i>National Stock Exchange</i>", tooltip=tooltip
).add_to(nse_map)
nse_map
Out[51]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This Project is to perform the Exploratory Data analysis on the Nifty 50 Stock data. In my analysis I will try to extract some useful standard results for Financial Investors or Companies. Here we use various libraries of Python for visualization of Data.

The Libraries which are used in Project are:

To install all required libraries, run the following Command:

pip install numpy pandas matplotlib seaborn plotly --upgrade

About Data visualization :

The Data visualization is the graphic representation of data.
It involves producing images that communicate relationships among the represented data to viewers.
Visualizing data is an essential part of data analysis and machine learning.
In this projedt, we'll use Python libraries like Matplotlib, Seaborn and Plotly to learn and apply some popular data visualization techniques.

Following Tasks are Implemented in the Project:

The following steps are made to accomplish the Project:







Now Let's Get Dive into the Project !!


Downloading and Importing the Dataset

This Project is on financial market domain and we have to download data appropriately. The Dataset is taken from www.kaggle.com (👈 Click to Download) and contains all price series for all 50 stock of Nifty index (India).

The data is the price history and trading volumes of the fifty stocks in the index NIFTY 50 from NSE (National Stock Exchange) India. All datasets are at a day-level with pricing and trading values split across .cvs files for each stock along with a metadata file with some macro-information about the stocks itself. The data spans from 27th November, 2007 to 30th October, 2020.

Acknowledgements

NSE India: https://www.nseindia.com/
Thanks to NSE for providing all the data publicly.

Let's begin by downloading the data, and listing the files within the dataset.

In [1]:
dataset_url = 'https://www.kaggle.com/rohanrao/nifty50-stock-market-data' 

The below statements are used to download the dataset if we are in Kaggle/Colab Notebook

import opendatasets as od

dataset_url= 'https://www.kaggle.com/rohanrao/nifty50-stock-market-data'

od.download(dataset_url)

But we require the data for our machine where we will be implementing in our localhost, therefore we prefer downloading the data externally, then extracting the data and placing in the root project directory.

Now Let's look our dataset

In [2]:
data_dir = './Stock_Market_Nifty_CSV/'  #we will be storing all the dataset files in a data_dir
In [122]:
import os
os.listdir(data_dir)
Out[122]:
['ADANIPORTS.csv',
 'ASIANPAINT.csv',
 'AXISBANK.csv',
 'BAJAJ-AUTO.csv',
 'BAJAJFINSV.csv',
 'BAJFINANCE.csv',
 'BHARTIARTL.csv',
 'BPCL.csv',
 'BRITANNIA.csv',
 'CIPLA.csv',
 'COALINDIA.csv',
 'DRREDDY.csv',
 'EICHERMOT.csv',
 'GAIL.csv',
 'GRASIM.csv',
 'HCLTECH.csv',
 'HDFC.csv',
 'HDFCBANK.csv',
 'HEROMOTOCO.csv',
 'HINDALCO.csv',
 'HINDUNILVR.csv',
 'ICICIBANK.csv',
 'INDUSINDBK.csv',
 'INFRATEL.csv',
 'INFY.csv',
 'IOC.csv',
 'ITC.csv',
 'JSWSTEEL.csv',
 'KOTAKBANK.csv',
 'LT.csv',
 'MARUTI.csv',
 'MM.csv',
 'NESTLEIND.csv',
 'NIFTY50_all.csv',
 'NTPC.csv',
 'ONGC.csv',
 'POWERGRID.csv',
 'RELIANCE.csv',
 'SBIN.csv',
 'SHREECEM.csv',
 'stock_metadata.csv',
 'SUNPHARMA.csv',
 'TATAMOTORS.csv',
 'TATASTEEL.csv',
 'TCS.csv',
 'TECHM.csv',
 'TITAN.csv',
 'ULTRACEMCO.csv',
 'UPL.csv',
 'VEDL.csv',
 'WIPRO.csv',
 'ZEEL.csv']

Importing all the libraries

In [53]:
# We will be including the libraries in this cell

import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_datareader.data as web
import plotly.graph_objs as go
import plotly.express as px
import folium
from plotly.offline import init_notebook_mode,iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

Now we have successfully imported all the libraries which we will be using in our Project.

Data Preparation and Data Cleaning

Data Preprocessing

  • Data preprocessing is a data analysis technique that involves transforming raw data into an understandable format.
  • Real-world data is often incomplete, inconsistent, or lacking in certain behaviors or trends, and is likely to contain many errors.
  • Data preprocessing is a proven method of resolving such issues.

Here we perform various modes of displaying dataset which is in CSV format. First step is to load the data using pandas read_csv function. The data is stored in mutidimensional table called as dataframe.

Now we will Create a new dataframe that merge single price history for all 50 stocks with stock-metadata extra informations (like sectors, ISIN, etc) useful for analysis. Adding some extra columns with financial formula.

Stock_metadata.csv : This data file will tell us about the Nifty50 Datasets. This will explain about te data present in Nifty50 stock csv

Steps to follow :

  • Load the File using Pandas
  • Look at some information about data and the columns
  • Fix the missing and incorrect values it they are present
In [55]:
# loading the stock_metedata.csv in a variable and checking its type

stock_metadata_df = pd.read_csv('./Stock_Market_Nifty_CSV/stock_metadata.csv') 
type(stock_metadata_df)
Out[55]:
pandas.core.frame.DataFrame
In [163]:
stock_metadata_df
Out[163]:
Company Name Industry Symbol Series ISIN Code
0 Adani Ports and Special Economic Zone Ltd. SERVICES ADANIPORTS EQ INE742F01042
1 Asian Paints Ltd. CONSUMER GOODS ASIANPAINT EQ INE021A01026
2 Axis Bank Ltd. FINANCIAL SERVICES AXISBANK EQ INE238A01034
3 Bajaj Auto Ltd. AUTOMOBILE BAJAJ-AUTO EQ INE917I01010
4 Bajaj Finserv Ltd. FINANCIAL SERVICES BAJAJFINSV EQ INE918I01018
5 Bajaj Finance Ltd. FINANCIAL SERVICES BAJFINANCE EQ INE296A01024
6 Bharti Airtel Ltd. TELECOM BHARTIARTL EQ INE397D01024
7 Bharat Petroleum Corporation Ltd. ENERGY BPCL EQ INE029A01011
8 Britannia Industries Ltd. CONSUMER GOODS BRITANNIA EQ INE216A01030
9 Cipla Ltd. PHARMA CIPLA EQ INE059A01026
10 Coal India Ltd. METALS COALINDIA EQ INE522F01014
11 Dr. Reddy's Laboratories Ltd. PHARMA DRREDDY EQ INE089A01023
12 Eicher Motors Ltd. AUTOMOBILE EICHERMOT EQ INE066A01013
13 GAIL (India) Ltd. ENERGY GAIL EQ INE129A01019
14 Grasim Industries Ltd. CEMENT & CEMENT PRODUCTS GRASIM EQ INE047A01021
15 HCL Technologies Ltd. IT HCLTECH EQ INE860A01027
16 Housing Development Finance Corporation Ltd. FINANCIAL SERVICES HDFC EQ INE001A01036
17 HDFC Bank Ltd. FINANCIAL SERVICES HDFCBANK EQ INE040A01034
18 Hero MotoCorp Ltd. AUTOMOBILE HEROMOTOCO EQ INE158A01026
19 Hindalco Industries Ltd. METALS HINDALCO EQ INE038A01020
20 Hindustan Unilever Ltd. CONSUMER GOODS HINDUNILVR EQ INE030A01027
21 ICICI Bank Ltd. FINANCIAL SERVICES ICICIBANK EQ INE090A01021
22 IndusInd Bank Ltd. FINANCIAL SERVICES INDUSINDBK EQ INE095A01012
23 Bharti Infratel Ltd. TELECOM INFRATEL EQ INE121J01017
24 Infosys Ltd. IT INFY EQ INE009A01021
25 Indian Oil Corporation Ltd. ENERGY IOC EQ INE242A01010
26 ITC Ltd. CONSUMER GOODS ITC EQ INE154A01025
27 JSW Steel Ltd. METALS JSWSTEEL EQ INE019A01038
28 Kotak Mahindra Bank Ltd. FINANCIAL SERVICES KOTAKBANK EQ INE237A01028
29 Larsen & Toubro Ltd. CONSTRUCTION LT EQ INE018A01030
30 Mahindra & Mahindra Ltd. AUTOMOBILE M&M EQ INE101A01026
31 Maruti Suzuki India Ltd. AUTOMOBILE MARUTI EQ INE585B01010
32 Nestle India Ltd. CONSUMER GOODS NESTLEIND EQ INE239A01016
33 NTPC Ltd. ENERGY NTPC EQ INE733E01010
34 Oil & Natural Gas Corporation Ltd. ENERGY ONGC EQ INE213A01029
35 Power Grid Corporation of India Ltd. ENERGY POWERGRID EQ INE752E01010
36 Reliance Industries Ltd. ENERGY RELIANCE EQ INE002A01018
37 State Bank of India FINANCIAL SERVICES SBIN EQ INE062A01020
38 Shree Cement Ltd. CEMENT & CEMENT PRODUCTS SHREECEM EQ INE070A01015
39 Sun Pharmaceutical Industries Ltd. PHARMA SUNPHARMA EQ INE044A01036
40 Tata Motors Ltd. AUTOMOBILE TATAMOTORS EQ INE155A01022
41 Tata Steel Ltd. METALS TATASTEEL EQ INE081A01012
42 Tata Consultancy Services Ltd. IT TCS EQ INE467B01029
43 Tech Mahindra Ltd. IT TECHM EQ INE669C01036
44 Titan Company Ltd. CONSUMER GOODS TITAN EQ INE280A01028
45 UltraTech Cement Ltd. CEMENT & CEMENT PRODUCTS ULTRACEMCO EQ INE481G01011
46 UPL Ltd. FERTILISERS & PESTICIDES UPL EQ INE628A01036
47 Vedanta Ltd. METALS VEDL EQ INE205A01025
48 Wipro Ltd. IT WIPRO EQ INE075A01022
49 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT ZEEL EQ INE256A01028

Now we got the dataframe for the stock_metadata.csv. Here we have five columns, they are:

  • Company Name : Name of the Company
  • Industry : To which industry does it belongs
  • Symbol : Symbol used to indicate
  • Series: Type of security
  • ISIN Code : ISIN code of security (International Securities Identification Number)

Note : EQ - It stands for Equity.

In [177]:
stock_metadata_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  50 non-null     object
 1   Industry      50 non-null     object
 2   Symbol        50 non-null     object
 3   Series        50 non-null     object
 4   ISIN Code     50 non-null     object
dtypes: object(5)
memory usage: 2.1+ KB
In [56]:
# loading the nifty50_all.csv in a variable and checking its type
NIFTY50_all_df = pd.read_csv('./Stock_Market_Nifty_CSV/NIFTY50_all.csv')
NIFTY50_all_df
Out[56]:
Date Symbol Series Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
0 2007-11-27 MUNDRAPORT EQ 440.00 770.00 1050.00 770.00 959.00 962.90 984.72 27294366 2.687719e+15 NaN 9859619.0 0.3612
1 2007-11-28 MUNDRAPORT EQ 962.90 984.00 990.00 874.00 885.00 893.90 941.38 4581338 4.312765e+14 NaN 1453278.0 0.3172
2 2007-11-29 MUNDRAPORT EQ 893.90 909.00 914.75 841.00 887.00 884.20 888.09 5124121 4.550658e+14 NaN 1069678.0 0.2088
3 2007-11-30 MUNDRAPORT EQ 884.20 890.00 958.00 890.00 929.00 921.55 929.17 4609762 4.283257e+14 NaN 1260913.0 0.2735
4 2007-12-03 MUNDRAPORT EQ 921.55 939.75 995.00 922.00 980.00 969.30 965.65 2977470 2.875200e+14 NaN 816123.0 0.2741
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
235187 2021-04-26 ZEEL EQ 188.00 190.60 191.10 185.10 186.70 186.40 187.35 8542755 1.600451e+14 52374.0 2340188.0 0.2739
235188 2021-04-27 ZEEL EQ 186.40 188.00 192.95 186.80 188.80 188.15 189.41 14247767 2.698636e+14 73673.0 5425957.0 0.3808
235189 2021-04-28 ZEEL EQ 188.15 188.80 190.60 187.10 188.95 189.10 188.85 8429439 1.591917e+14 44056.0 2413974.0 0.2864
235190 2021-04-29 ZEEL EQ 189.10 190.80 191.65 186.00 186.60 186.55 187.44 9483009 1.777471e+14 60932.0 2744472.0 0.2894
235191 2021-04-30 ZEEL EQ 186.55 185.30 190.95 183.65 185.00 185.60 187.53 11435285 2.144440e+14 62607.0 3323909.0 0.2907

235192 rows × 15 columns

Now the description of column names of above Dataframe

  • Date - Date ranges frpm 2007 t0 2021
  • Symbol - Unique Company symbol under Nifty50
  • Series -
  • Prev Close -
  • Open and Close - Indicate the opening and closing price of the stocks on a particular day.
  • High and Low - provide the highest and the lowest price for the stock on a particular day, respectively.
  • Last -
  • VWAP - Volume-weighted average price
  • Volume - This column tells us the total volume of stocks traded on a particular day.
  • Turnover - refers to the total value of stocks traded during a specific period of time. The time period may be annually, quarterly, monthly or daily
  • Trades -
  • Deliverable Volume - Total Volume of stock that can be deliverable
  • %Deliverable -

Source: Investopedia.com

In [115]:
nse_india_df = NIFTY50_all_df.merge(stock_metadata_df, on="Symbol")
nse_india_df

# nse_india_df = NIFTY50_all_df + stock_metadata_df (combined for the sake of new column names)
Out[115]:
Date Symbol Series_x Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble Company Name Industry Series_y ISIN Code
0 2012-01-17 ADANIPORTS EQ 135.50 137.1 141.00 135.00 140.10 140.00 138.13 1636196 2.260074e+13 18374.0 1004327.0 0.6138 Adani Ports and Special Economic Zone Ltd. SERVICES EQ INE742F01042
1 2012-01-18 ADANIPORTS EQ 140.00 142.0 143.80 138.70 143.00 141.70 141.25 890591 1.257986e+13 15615.0 404925.0 0.4547 Adani Ports and Special Economic Zone Ltd. SERVICES EQ INE742F01042
2 2012-01-19 ADANIPORTS EQ 141.70 144.0 150.55 143.15 149.50 149.40 146.72 1456077 2.136382e+13 31299.0 721545.0 0.4955 Adani Ports and Special Economic Zone Ltd. SERVICES EQ INE742F01042
3 2012-01-20 ADANIPORTS EQ 149.40 151.9 157.60 150.25 155.40 155.40 153.76 1634070 2.512583e+13 23335.0 861145.0 0.5270 Adani Ports and Special Economic Zone Ltd. SERVICES EQ INE742F01042
4 2012-01-23 ADANIPORTS EQ 155.40 155.4 155.40 145.10 146.40 146.75 149.54 1657609 2.478768e+13 12400.0 820653.0 0.4951 Adani Ports and Special Economic Zone Ltd. SERVICES EQ INE742F01042
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
207845 2021-04-26 ZEEL EQ 188.00 190.6 191.10 185.10 186.70 186.40 187.35 8542755 1.600451e+14 52374.0 2340188.0 0.2739 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT EQ INE256A01028
207846 2021-04-27 ZEEL EQ 186.40 188.0 192.95 186.80 188.80 188.15 189.41 14247767 2.698636e+14 73673.0 5425957.0 0.3808 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT EQ INE256A01028
207847 2021-04-28 ZEEL EQ 188.15 188.8 190.60 187.10 188.95 189.10 188.85 8429439 1.591917e+14 44056.0 2413974.0 0.2864 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT EQ INE256A01028
207848 2021-04-29 ZEEL EQ 189.10 190.8 191.65 186.00 186.60 186.55 187.44 9483009 1.777471e+14 60932.0 2744472.0 0.2894 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT EQ INE256A01028
207849 2021-04-30 ZEEL EQ 186.55 185.3 190.95 183.65 185.00 185.60 187.53 11435285 2.144440e+14 62607.0 3323909.0 0.2907 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT EQ INE256A01028

207850 rows × 19 columns

In [69]:
# print all the column names of the nse_india_df

for col in nse_india_df.columns:
    print(col)

#Count the total number of columns in the merged dataframe
len(nse_india_df.columns)
Date
Symbol
Series_x
Prev Close
Open
High
Low
Last
Close
VWAP
Volume
Turnover
Trades
Deliverable Volume
%Deliverble
Company Name
Industry
Series_y
ISIN Code
Out[69]:
19
In [67]:
nse_india_df.shape
Out[67]:
(207850, 19)
In [70]:
nse_india_df.isnull()
Out[70]:
Date Symbol Series_x Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble Company Name Industry Series_y ISIN Code
0 False False False False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
207845 False False False False False False False False False False False False False False False False False False False
207846 False False False False False False False False False False False False False False False False False False False
207847 False False False False False False False False False False False False False False False False False False False
207848 False False False False False False False False False False False False False False False False False False False
207849 False False False False False False False False False False False False False False False False False False False

207850 rows × 19 columns

In [ ]:
 
In [95]:
nse_india_df.isnull().sum()
Out[95]:
Date                      0
Symbol                    0
Series_x                  0
Prev Close                0
Open                      0
High                      0
Low                       0
Last                      0
Close                     0
VWAP                      0
Volume                    0
Turnover                  0
Trades                89303
Deliverable Volume    10807
%Deliverble           10807
Company Name              0
Industry                  0
Series_y                  0
ISIN Code                 0
dtype: int64
In [73]:
nse_india_df.fillna(nse_india_df.mean(), inplace=True)
nse_india_df.isnull().sum()

# This operation will take approx 25 minutes to fill all the empty cells
Out[73]:
Date                  0
Symbol                0
Series_x              0
Prev Close            0
Open                  0
High                  0
Low                   0
Last                  0
Close                 0
VWAP                  0
Volume                0
Turnover              0
Trades                0
Deliverable Volume    0
%Deliverble           0
Company Name          0
Industry              0
Series_y              0
ISIN Code             0
dtype: int64
In [118]:
nse_india_df.Close.max()
Out[118]:
32861.95
In [117]:
nse_india_df.Close.min()
Out[117]:
9.15

Now before proceding further with Data Analysis, we will examine with the present dataframe by simply plotting using Matplotlib

In [156]:
# here we try to reset the index based on 'Close' column accordingly in the nse_india_df dataframe
# Then store in nse_india_viz variable and visualize with plot.

nse_india_viz = nse_india_df.reset_index()['Close']
plt.figure(figsize=(15,5))
plt.plot(nse_india_viz)
plt.title('Whole Dataset plot view of Month end closure value')
plt.xlabel('Maximum EOM value')
plt.ylabel('Total number of Rows')

plt.show()


We have successfully Plotted the First visualization using Matplotlib. Using Entire companies dataset of 207850 rows × 19 columns ......

⚠️ ⚠️ ⚠️ But Here their are few Disadvantages that causes if we move further ⚠️ ⚠️ ⚠️

  • This will lead to the Dense Plots
  • Incomplete fitting of rows
  • Irregular patterns or insights of data
  • Key assumptions will be lost
  • Causing False impression.

To overcome these, we will use, One single Company insteard of Entire companies, To dive further into the Analysis

The List of Symbols of all Companies of Nifty50

A B C D E
ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV
BAJFINANCE BHARTIARTL BPCL BRITANNIA CIPLA
COALINDIA DRREDDY EICHERMOT GAIL GRASIM
HCLTECH HDFC HDFCBANK HEROMOTOCO HINDALCO
HINDUNILVR ICICIBANK INDUSINDBK INFRATEL INFY
IOC ITC JSWSTEEL KOTAKBANK LT
M&M MARUTI NESTLEIND NTPC ONGC
POWERGRID RELIANCE SBIN SHREECEM SUNPHARMA
TATAMOTORS TATASTEEL TCS TECHM TITAN
ULTRACEMCO UPL VEDL WIPRO ZEEL

Here their are Symbols of all Companies which are in Nifty50. We will be Getting the detailed anslysis of each company individually.

In [57]:
c_list = ['ADANIPORTS','ASIANPAINT','AXISBANK','BAJAJ-AUTO','BAJAJFINSV','BAJFINANCE','BHARTIARTL','BPCL','BRITANNIA','CIPLA',
        'COALINDIA','DRREDDY','EICHERMOT','GAIL','GRASIM','HCLTECH','HDFC','HDFCBANK','HEROMOTOCO','HINDALCO','HINDUNILVR',
        'ICICIBANK','INDUSINDBK','INFRATEL','INFY','IOC','ITC','JSWSTEEL','KOTAKBANK','LT','M&M','MARUTI','NESTLEIND','NTPC',
        'ONGC','POWERGRID','RELIANCE','SBIN','SHREECEM','SUNPHARMA','TATAMOTORS','TATASTEEL','TCS','TECHM','TITAN','ULTRACEMCO',
        'UPL','VEDL','WIPRO','ZEEL']

c_name = input('Enter the Company name to perform the Analysis : ')
if c_name or c_name.swapcase() in c_list:
    comp = c_name.upper()
else:
    print('The Given name is not Present in Nifty50 Data')
Enter the Company name to perform the Analysis : hdfc
In [58]:
comp_var = './Stock_Market_Nifty_CSV/'+comp+'.csv'
comp_var
Out[58]:
'./Stock_Market_Nifty_CSV/HDFC.csv'
In [59]:
comp_df = pd.read_csv(comp_var)
comp_df
Out[59]:
Date Symbol Series Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
0 2000-01-03 HDFC EQ 271.75 293.50 293.50 293.50 293.50 293.50 293.50 22744 6.675364e+11 NaN NaN NaN
1 2000-01-04 HDFC EQ 293.50 317.00 317.00 297.00 304.00 304.05 303.62 255251 7.749972e+12 NaN NaN NaN
2 2000-01-05 HDFC EQ 304.05 290.00 303.90 285.00 295.00 292.80 294.53 269087 7.925368e+12 NaN NaN NaN
3 2000-01-06 HDFC EQ 292.80 301.00 314.00 295.00 296.00 296.45 300.14 305916 9.181669e+12 NaN NaN NaN
4 2000-01-07 HDFC EQ 296.45 290.00 296.35 281.00 287.10 286.55 288.80 197039 5.690480e+12 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5301 2021-04-26 HDFC EQ 2497.35 2500.00 2534.10 2483.20 2502.00 2509.80 2508.07 3916088 9.821805e+14 121028.0 2440395.0 0.6232
5302 2021-04-27 HDFC EQ 2509.80 2494.15 2526.80 2486.25 2514.00 2518.40 2509.18 2040799 5.120730e+14 102250.0 1040749.0 0.5100
5303 2021-04-28 HDFC EQ 2518.40 2516.10 2609.00 2508.30 2575.00 2577.00 2574.21 3407461 8.771527e+14 117425.0 1815110.0 0.5327
5304 2021-04-29 HDFC EQ 2577.00 2590.90 2628.00 2533.00 2539.70 2538.85 2569.65 3005468 7.722995e+14 132826.0 1472924.0 0.4901
5305 2021-04-30 HDFC EQ 2538.85 2503.10 2525.00 2411.10 2433.25 2420.10 2445.94 6024595 1.473581e+15 224454.0 3839105.0 0.6372

5306 rows × 15 columns

Now Lets quickly perform a simple Data preprocessing for the Selected Company

In [412]:
print('The type of this dataframe is : ',type(comp_df))
The type of this dataframe is :  <class 'pandas.core.frame.DataFrame'>
In [94]:
print('Rows and Columns in the Dataframe :')
comp_df
Rows and Columns in the Dataframe :
Out[94]:
Date Symbol Series Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
0 2000-01-03 ONGC EQ 207.70 205.00 214.4 205.00 214.00 213.45 209.91 9600 2.015090e+11 NaN NaN NaN
1 2000-01-04 ONGC EQ 213.45 201.50 209.7 201.00 209.00 206.55 206.26 17900 3.692030e+11 NaN NaN NaN
2 2000-01-05 ONGC EQ 206.55 201.00 223.1 200.50 223.10 223.10 215.46 27000 5.817490e+11 NaN NaN NaN
3 2000-01-06 ONGC EQ 223.10 234.00 234.9 217.60 217.60 219.30 223.42 35600 7.953680e+11 NaN NaN NaN
4 2000-01-07 ONGC EQ 219.30 223.00 223.0 213.00 215.70 215.75 215.19 11400 2.453145e+11 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5301 2021-04-26 ONGC EQ 102.40 105.25 105.7 102.50 102.80 102.80 103.86 11797791 1.225355e+14 63132.0 3024740.0 0.2564
5302 2021-04-27 ONGC EQ 102.80 102.80 104.0 102.80 103.30 103.20 103.42 8886250 9.189768e+13 46923.0 3802836.0 0.4279
5303 2021-04-28 ONGC EQ 103.20 103.75 104.4 103.30 104.05 103.90 103.90 6887787 7.156081e+13 33659.0 2533669.0 0.3678
5304 2021-04-29 ONGC EQ 103.90 104.90 105.9 103.55 104.30 104.05 104.39 14990087 1.564876e+14 54416.0 4613745.0 0.3078
5305 2021-04-30 ONGC EQ 104.05 104.15 112.7 103.30 108.20 108.15 109.79 81358264 8.932632e+14 248419.0 15306987.0 0.1881

5306 rows × 15 columns

In [95]:
print('Number of Rows and Columns in the : ')
comp_df.shape
Number of Rows and Columns in the : 
Out[95]:
(5306, 15)
In [96]:
print('Print the Dimention of our dataframe : ')
comp_df.ndim
Print the Dimention of our dataframe : 
Out[96]:
2
In [97]:
comp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5306 entries, 0 to 5305
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                5306 non-null   object 
 1   Symbol              5306 non-null   object 
 2   Series              5306 non-null   object 
 3   Prev Close          5306 non-null   float64
 4   Open                5306 non-null   float64
 5   High                5306 non-null   float64
 6   Low                 5306 non-null   float64
 7   Last                5306 non-null   float64
 8   Close               5306 non-null   float64
 9   VWAP                5306 non-null   float64
 10  Volume              5306 non-null   int64  
 11  Turnover            5306 non-null   float64
 12  Trades              2456 non-null   float64
 13  Deliverable Volume  4797 non-null   float64
 14  %Deliverble         4797 non-null   float64
dtypes: float64(11), int64(1), object(3)
memory usage: 621.9+ KB
In [418]:
#Descriptive Staistics

print('The tabular Description of the dataframe')
comp_df.describe()
The tabular Description of the dataframe
Out[418]:
Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
count 5306.000000 5306.000000 5306.000000 5306.000000 5306.000000 5306.000000 5306.000000 5.306000e+03 5.306000e+03 5306.000000 5.306000e+03 5306.000000
mean 491.156954 491.914832 500.182152 482.758509 491.102657 491.138193 491.554470 4.627373e+06 1.192360e+14 62667.979642 2.387201e+06 0.513413
std 385.181628 385.689433 392.444581 378.243050 385.088030 385.197861 385.476883 8.938633e+06 1.264207e+14 25615.119831 3.587651e+06 0.141105
min 60.000000 59.400000 63.000000 50.000000 60.200000 60.000000 61.000000 5.000000e+01 8.012500e+08 864.000000 9.970000e+02 0.082900
25% 177.462500 178.500000 180.712500 175.512500 177.475000 177.450000 177.687500 8.181695e+05 5.600668e+13 56789.250000 5.865435e+05 0.433600
50% 305.400000 306.000000 310.000000 301.725000 305.500000 305.400000 306.135000 2.082784e+06 9.902516e+13 62667.979642 1.579048e+06 0.513413
75% 845.125000 844.900000 859.300000 830.000000 844.775000 845.125000 845.655000 4.839795e+06 1.548728e+14 62667.979642 2.747347e+06 0.609500
max 1484.200000 1480.550000 1513.900000 1443.150000 1490.100000 1484.200000 1482.260000 1.785935e+08 4.348316e+15 364976.000000 9.587384e+07 1.000000
In [99]:
print('The Correlation among all the rows and columns')
comp_df.corr()
The Correlation among all the rows and columns
Out[99]:
Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
Prev Close 1.000000 0.999137 0.998944 0.998737 0.998686 0.998710 0.998946 -0.314333 0.131939 -0.156357 -0.399584 0.073291
Open 0.999137 1.000000 0.999590 0.999419 0.999236 0.999266 0.999528 -0.314461 0.131807 -0.154275 -0.399930 0.072725
High 0.998944 0.999590 1.000000 0.999259 0.999635 0.999662 0.999800 -0.313025 0.137109 -0.142449 -0.398597 0.068740
Low 0.998737 0.999419 0.999259 1.000000 0.999561 0.999588 0.999684 -0.315354 0.127451 -0.162170 -0.401144 0.075034
Last 0.998686 0.999236 0.999635 0.999561 1.000000 0.999974 0.999865 -0.314191 0.132536 -0.152247 -0.399927 0.071477
Close 0.998710 0.999266 0.999662 0.999588 0.999974 1.000000 0.999901 -0.314109 0.132519 -0.151943 -0.399876 0.071407
VWAP 0.998946 0.999528 0.999800 0.999684 0.999865 0.999901 1.000000 -0.314151 0.132823 -0.152275 -0.399931 0.071237
Volume -0.314333 -0.314461 -0.313025 -0.315354 -0.314191 -0.314109 -0.314151 1.000000 0.640352 0.745926 0.902852 -0.172920
Turnover 0.131939 0.131807 0.137109 0.127451 0.132536 0.132519 0.132823 0.640352 1.000000 0.812309 0.652187 0.003434
Trades -0.156357 -0.154275 -0.142449 -0.162170 -0.152247 -0.151943 -0.152275 0.745926 0.812309 1.000000 0.636332 -0.270596
Deliverable Volume -0.399584 -0.399930 -0.398597 -0.401144 -0.399927 -0.399876 -0.399931 0.902852 0.652187 0.636332 1.000000 0.032062
%Deliverble 0.073291 0.072725 0.068740 0.075034 0.071477 0.071407 0.071237 -0.172920 0.003434 -0.270596 0.032062 1.000000
In [100]:
print('Print top five rows dataframe : ')
comp_df.head(5)
Print top five rows dataframe : 
Out[100]:
Date Symbol Series Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
0 2000-01-03 ONGC EQ 207.70 205.0 214.4 205.0 214.0 213.45 209.91 9600 2.015090e+11 NaN NaN NaN
1 2000-01-04 ONGC EQ 213.45 201.5 209.7 201.0 209.0 206.55 206.26 17900 3.692030e+11 NaN NaN NaN
2 2000-01-05 ONGC EQ 206.55 201.0 223.1 200.5 223.1 223.10 215.46 27000 5.817490e+11 NaN NaN NaN
3 2000-01-06 ONGC EQ 223.10 234.0 234.9 217.6 217.6 219.30 223.42 35600 7.953680e+11 NaN NaN NaN
4 2000-01-07 ONGC EQ 219.30 223.0 223.0 213.0 215.7 215.75 215.19 11400 2.453145e+11 NaN NaN NaN
In [101]:
print('Print bottom five rows of dataframe : ')
comp_df.tail(5)
Print bottom five rows of dataframe : 
Out[101]:
Date Symbol Series Prev Close Open High Low Last Close VWAP Volume Turnover Trades Deliverable Volume %Deliverble
5301 2021-04-26 ONGC EQ 102.40 105.25 105.7 102.50 102.80 102.80 103.86 11797791 1.225355e+14 63132.0 3024740.0 0.2564
5302 2021-04-27 ONGC EQ 102.80 102.80 104.0 102.80 103.30 103.20 103.42 8886250 9.189768e+13 46923.0 3802836.0 0.4279
5303 2021-04-28 ONGC EQ 103.20 103.75 104.4 103.30 104.05 103.90 103.90 6887787 7.156081e+13 33659.0 2533669.0 0.3678
5304 2021-04-29 ONGC EQ 103.90 104.90 105.9 103.55 104.30 104.05 104.39 14990087 1.564876e+14 54416.0 4613745.0 0.3078
5305 2021-04-30 ONGC EQ 104.05 104.15 112.7 103.30 108.20 108.15 109.79 81358264 8.932632e+14 248419.0 15306987.0 0.1881

Handling Null values

In [102]:
comp_df.isnull().sum()
Out[102]:
Date                     0
Symbol                   0
Series                   0
Prev Close               0
Open                     0
High                     0
Low                      0
Last                     0
Close                    0
VWAP                     0
Volume                   0
Turnover                 0
Trades                2850
Deliverable Volume     509
%Deliverble            509
dtype: int64
In [103]:
numerics = ['int32', 'int64', 'float32', 'float64']
xp = comp_df.select_dtypes(include = numerics)

print('Total number of columns : ',len(comp_df.columns))
print('Numeric type columns : ',len(xp.columns))
print('Non-numeric/Object type columns : ',len(comp_df.columns) - len(xp.columns))
Total number of columns :  15
Numeric type columns :  12
Non-numeric/Object type columns :  3
In [104]:
comp_df.fillna(comp_df.mean(), inplace=True)
comp_df.isnull().sum()

#comp_df[col_name].fillna('char', inplace=True)
#comp_df.isnull().sum()
Out[104]:
Date                  0
Symbol                0
Series                0
Prev Close            0
Open                  0
High                  0
Low                   0
Last                  0
Close                 0
VWAP                  0
Volume                0
Turnover              0
Trades                0
Deliverable Volume    0
%Deliverble           0
dtype: int64
In [ ]:
 
In [ ]:
 
In [106]:
a1 = comp_df.Date.max()
a2 = comp_df.Date.min()
a3 = comp_df.High.max()
a4 = comp_df.Low.min()
a5 = comp_df.Volume.mean()
a6 = comp_df.Turnover.max()
a7 = comp_df.Series.unique()[0]
a8 = comp_df.Last.max()

d1 = comp_df[comp_df['High'] == comp_df.High.max()]['Date']
d2 = comp_df[comp_df['Low'] == comp_df.Low.min()]['Date']
In [384]:
#### print('Whats the Stating date of '+comp+' :',a1)
print('\nWhats the Last date in the data of '+comp+' :',a2)
print('On this Date ',str(d1)[5:15])
print('\nHow much did the Highest recorded value of stock has gained of '+comp+' :',a3)
print('On this Date ',str(d2)[8:18])
print('\nHow much did the Least recorded value of stock has been of '+comp+' :',a4)
print('\nOf overall period, whats the average of Volume of '+comp+' :',a5)
print('\nThe total value of stocks traded during a specific period of time of '+comp+' :',a6)
print('\nThe Security type offered for '+comp+' company by Nifty50 is :',a7)
print('\nThe maximum profit of the New or recent Close of stock in '+comp+' :',a8)
Whats the Last date in the data of ONGC : 2000-01-03
On this Date     2006-05

How much did the Highest recorded value of stock has gained of ONGC : 1513.9
On this Date  2020-03-13

How much did the Least recorded value of stock has been of ONGC : 50.0

Of overall period, whats the average of Volume of ONGC : 4627373.344327177

The total value of stocks traded during a specific period of time of ONGC : 4348315656105001.0

The Security type offered for ONGC company by Nifty50 is : EQ

The maximum profit of the New or recent Close of stock in ONGC : 1490.1
In [ ]:
 

This ends our preprocessing task, moving to the Actual data analysis and visualization part

Exploratory Analysis and Visualization

What is Exploratory Data Analysis(EDA)?

If we want to explain EDA in simple terms, trying to understand the given data much better, so that we can make some sense out of it.

According to Wikipedia:

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

  • EDA in Python uses data visualization to draw meaningful patterns and insights.
  • It also involves the preparation of data sets for analysis by removing irregularities in the data.
  • Based on the results of EDA, companies also make business decisions, which can help them to lose in future later.

Here we mainly concentrate on data visualization, in Python their are mainly 3 libraries are their Matplotlib.pyplot, Seaborn and one more is Plotly. Using these we wil perform the EDA. Navigation into data: starting to plot a simple series of closing price (EOM) for a single stock, all others analysis regarding group of data.

In [22]:
comp_df.columns
Out[22]:
Index(['Date', 'Symbol', 'Series', 'Prev Close', 'Open', 'High', 'Low', 'Last',
       'Close', 'VWAP', 'Volume', 'Turnover', 'Trades', 'Deliverable Volume',
       '%Deliverble'],
      dtype='object')

For doing Analysis, we will be selecting the specific columns out of the 15 columns and Generating the insights/plots.

  • Date
  • High and Low
  • Open and Close
  • Volume

Analysis 1 :

Plotting price EOM (end of month) of single stock Company limited from the selected range of end of month 'Close' value.

In [283]:
plt.figure(figsize=(15,5))
an_1 = comp_df['Close'][1500:2600] #ranges from 0 to 5306
plt.title('Plot view of closure value of selected company')
plt.xlabel('Maximum EOM Close value')
plt.ylabel('Total number of Rows')
plt.plot(an_1)
Out[283]:
[<matplotlib.lines.Line2D at 0x1b5bda57c10>]
In [388]:
comp_df['Date'] =  pd.to_datetime(comp_df['Date'])
show_month = comp_df.set_index('Date')
show_month['Close'].loc['2012'].plot(figsize=(15,7), color='Red', linestyle='--')
plt.title("Stock Close Price")
plt.ylabel('Close Value')
plt.show()
In [276]:
comp_df['Open'][:2600].plot(legend=True,figsize=(13,6), grid=True)
plt.legend(bbox_to_anchor=(1.1, 0.75))
plt.ylabel('Close value range')
plt.xlabel('Year range from 2000 to 2014')
plt.title("Stock Close Price")
plt.autoscale()
plt.show()
In [385]:
fig = px.line(comp_df[5000:], x='Date', y='Volume', title='Volume of Stock on Perticular Date')
fig
In [ ]:
 

Analysis 2 :

In [333]:
fig = px.line(comp_df, x='Date', y='Last', title=' ')

fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=6, label="6m", step="month", stepmode="backward"),
            dict(count=1, label="YTD", step="year", stepmode="todate"),
            dict(count=1, label="1y", step="year", stepmode="backward"),
            dict(step="all")
        ])
    )
)

fig.update_layout(plot_bgcolor='rgb(250, 242, 242)',
    title='NIFTY_50 : Major single day falls in '+comp+' from 2000 onwards',
    yaxis_title='NIFTY 50 Stock',
    shapes = [dict(x0='2020-03-23', x1='2020-03-23', y0=0, y1=1, xref='x', yref='paper', line_width=2,opacity=0.3,line_color='red',editable=False),
             dict(x0='2019-09-3', x1='2019-09-3', y0=0, y1=1, xref='x', yref='paper',line_width=3,opacity=0.3,line_color='red'),
             dict(x0='2020-02-1', x1='2020-02-1', y0=0, y1=1, xref='x', yref='paper',line_width=3,opacity=0.3,line_color='red'),
             dict(x0='2020-03-12', x1='2020-03-12', y0=0, y1=1, xref='x', yref='paper',line_width=3,opacity=0.3,line_color='red')],
    annotations=[dict(x='2020-03-23', y=0.5, xref='x', yref='paper',
                    showarrow=False, xanchor='left', text='Lockdown Phase-1 announced'),
                dict(x='2019-09-3', y=0.05, xref='x', yref='paper',
                    showarrow=False, xanchor='left', text='Multiple PSU Bank Merger Announcements'),
                dict(x='2020-02-1', y=0.5, xref='x', yref='paper',
                    showarrow=False, xanchor='right', text='Union Budget,coronavirus pandemic'),
                dict(x='2020-03-12', y=0.3, xref='x', yref='paper',
                    showarrow=False, xanchor='right', text='Coronavirus declared Pandemic by WHO')]
)
fig.show()
In [346]:
fig = go.Figure()

fig.add_trace(go.Scatter(
         x=comp_df['Date'],
         y=comp_df['High'][500:1300],
         name='High Price',
    line=dict(color='blue'),
    opacity=0.8))

fig.add_trace(go.Scatter(
         x=comp_df['Date'],
         y=comp_df['Low'][500:1300],
         name='Low Price',
    line=dict(color='orange'),
    opacity=0.8))
    
fig.update_layout(title_text='Comparision of High vs Low of '+comp+' Industry of Nifty50',
                  plot_bgcolor='rgb(250, 242, 242)',yaxis_title='Low and High Values',xaxis_title='Years')

fig.show()

Analysis 3 :

In [347]:
data= comp_df[['Volume','High']][1200:1400]
type(data)
# draw jointplot with
# hex kind
sns.jointplot(x = "Volume", y = "High",kind = "kde", data = data, color='Red')
# show the plot

plt.show()
In [348]:
data= comp_df[['Volume','Low']][1000:1400]
type(data)
# draw jointplot with
# hex kind
sns.jointplot(x = "Volume", y = "Low",kind = "scatter", data = data, color='Navy')
# show the plot
plt.show()
In [349]:
data= comp_df[['Volume','Turnover']][1300:1400]
type(data)
# draw jointplot with
# hex kind
sns.jointplot(x = "Volume", y = "Turnover",kind = "hex", data = data, color='Green')
# show the plot
plt.show()
In [400]:
# Comparing Google to itself shows a perfectly linear relationship

sns.jointplot(comp_df[1300:1400].High,comp_df[1300:1400].Close,kind='scatter',color='seagreen')
Out[400]:
<seaborn.axisgrid.JointGrid at 0x1b5bf11cfd0>
In [ ]:
sns.pairplot(comp_df[1200:1300].loc[0:,['High','Low','Open','Close','Volume']])
plt.show()
In [ ]:
 
In [485]:
import random
from itertools import count
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

plt.style.use('fivethirtyeight')

x_values = []
y_values = []

index = count()


def animate():
    data = pd.read_csv('./Stock_Market_Nifty_CSV/ITC.csv')
    x_values.append(data['Open'])
    y_values = data['Close']
    plt.cla()
    
    plt.plot(x_values, y_values)
    plt.xlabel('Time')
    plt.ylabel('Price')
    plt.title('ITC')
    plt.gcf().autofmt_xdate()
    plt.tight_layout()

ani = FuncAnimation(plt.gcf(), animate, 5000)
ani
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
In [490]:
ani
Out[490]:
<matplotlib.animation.FuncAnimation at 0x1b5d6d4f850>
In [498]:
import matplotlib.pyplot as plt
import matplotlib.animation
import numpy as np

t = comp_df[2800:3000].High
x = comp_df[2800:3000].Volume

fig, ax = plt.subplots()
ax.axis([0,2*np.pi,-1,1])
l, = ax.plot([],[])

def animate(i):
    l.set_data(t[:i], x[:i])

ani = matplotlib.animation.FuncAnimation(fig, animate, frames=len(t))

from IPython.display import HTML
HTML(ani.to_jshtml())

plt.rcParams["animation.html"] = "jshtml"
In [499]:
ani
Out[499]:
In [ ]:
 
In [477]:
comp_df[:500]["High"].iplot(kind="histogram", bins=170, theme="white", title="Total Volume of Stock",xTitle='For '+comp+' company as per CSV', yTitle='High value')
In [ ]:
 

Analysis 4 :

In [350]:
#we will check correlation between features in dataset.
my_title = 'Correlation Matrix for the Feature/Coloumns present in '+comp+ 'Using Horizontal bar type'
comp_df.corr().iplot(kind="bar", title=my_title)
In [387]:
comp_df.corr().iplot(kind='heatmap',
                     colorscale='Blues',
                     title="Feature Correlation Matrixpresent in "+comp)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Analysis 5 :

In [246]:
df_genre = comp_df[100:2000:85].groupby('Date')

def genreBased(comp_df_feature):
    xrange = np.arange(1,len(df_genre.sum())+1)
    fig,ax= plt.subplots(ncols=2,figsize=(18,6))
    df_to_plot = df_genre.sum().sort_values(by=comp_df_feature,ascending =False)[::-1]
    df_to_plot[comp_df_feature].plot(kind='barh')
    plt.title(comp_df_feature)
    
    #labels
    ax[1].set_ylabel(None)
    ax[1].tick_params(axis='both', which='major', labelsize=15)
    ax[1].set_xlabel('', fontsize=15,labelpad=21)
    
    #spines
    ax[1].spines['top'].set_visible(False)
    ax[1].spines['right'].set_visible(False)
    ax[1].grid(False)
    
    #annotations    
    for x,y in zip(np.arange(len(df_genre.sum())+1),df_genre.sum().sort_values(by=comp_df_feature,ascending =False)[::-1][comp_df_feature]):
        label = "{:}".format(y)
        labelr = round(y,2)
        plt.annotate(labelr, # this is the text
                     (y,x), # this is the point to label
                      textcoords="offset points",# how to position the text
                     xytext=(6,0), # distance from text to points (x,y)
                    ha='left',va="center")
     
    #donut chart
    theme = plt.get_cmap('Blues')
    ax[0].set_prop_cycle("color", [theme(1. * i / len(df_to_plot))for i in range(len(df_to_plot))])    
    wedges, texts,_ = ax[0].pie(df_to_plot[comp_df_feature], wedgeprops=dict(width=0.45), startangle=-45,labels=df_to_plot.index,
                      autopct="%.1f%%",textprops={'fontsize': 13,})

 
    plt.tight_layout()    
In [248]:
genreBased('Open') #ABOVE
print()
genreBased('Close') #BELOW

Analysis 6 :

In [67]:
sns.boxplot(x="Volume", y="Symbol",  data=comp_df[1000:1500])
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x19f73afb490>
In [ ]:
 
In [421]:
#Displot Method
df_name = ['High', 'Low']
j = 0
df_lst = [comp_df.High,
          comp_df.Low]

for i in df_lst:
    plt.figure(figsize=(16,5))
    sns.distplot(i)
    plt.title('Displot Method for '+str(df_name[j])+' Stock', fontdict={'fontsize':24})
    j += 1
In [ ]:
 

Analysis 7 :

In [182]:
df = comp_df
fig = px.scatter_3d(df, x=comp_df[500:1000:13].Open,
                    y=comp_df[500:1000:13].Close,
                    z=comp_df[500:1000:13].High,
                    color=comp_df[500:1000:13].Date)
fig.
fig.show()
In [284]:
color_set = ['aggrnyl','agsunset','blackbody','bluered','blues','blugrn','bluyl','brwnyl','bugn','bupu','burg','burgyl','cividis','darkmint','electric','emrld','gnbu','greens','greys','hot','inferno','jet','magenta','magma','mint','orrd','oranges','oryel','peach','pinkyl','plasma','plotly3','pubu','pubugn','purd','purp','purples','purpor','rainbow','rdbu','rdpu','redor','reds','sunset','sunsetdark','teal','tealgrn','turbo','viridis','ylgn','ylgnbu','ylorbr','ylorrd','algae','amp','deep','dense','gray','haline','ice','matter','solar','speed','tempo','thermal','turbid','armyrose','brbg','earth','fall','geyser','prgn','piyg','picnic','portland','puor','rdgy','rdylbu','rdylgn','spectral','tealrose','temps','tropic','balance','curl','delta','oxy','edge','hsv','icefire','phase','twilight','mrybm','mygbm']

my_col = random.choice(color_set)
t = np.linspace(0, 4000, 120)
x, y, z = pd.Series(comp_df[:600].Volume), pd.Series(comp_df[:600].Open), t

fig = go.Figure(data=[go.Scatter3d(
    x=x,
    y=y,
    z=z,   
    mode='markers',
    marker=dict(
        size=7,
        color=z,               # set color to an array/list of desired values
        colorscale=my_col,   # choose a colorscale
        opacity=0.8
    )
)])

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
In [354]:
z_data = comp_df[:500]
z = z_data.values
sh_0, sh_1 = z.shape
x, y = np.linspace(0, 10, sh_0), np.linspace(0, 10, sh_1)
fig = go.Figure(data=[go.Surface(x=x, y=y, z=z)])
fig.update_layout(title='Mt Bruno Elevation', autosize=False,
                  width=650, height=650,
                  margin=dict(l=6, r=5, b=0, t=0))
fig.show()

Finally we have come to the end of EDA, and we have gathered few great Patterns on the Data on specific Company of Nifty50, But this cannot be the end, we can gather still more Patterns or insights.
Now moving further to the Next Step in our Project.

Stock Price Prediction with Sklearn

In [467]:
from sklearn.model_selection import train_test_split
from datetime import datetime, date
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
In [468]:
# start date should be within 5 years of current date according to iex API we have used
# The more data we have, the better results we get!

start = datetime(2016, 1, 1)
end = date.today()
df = comp_df[1:]
In [469]:
prices = df[df.columns[0:1]]
prices.reset_index(level=0, inplace=True)
prices["timestamp"] = pd.to_datetime(prices.Date).values.astype(float)

#prices['timestamp'] = ADBE['Date'].values.astype(float)
#ols1 = pd.ols(y=ADBE['Close'], x=ADBE['Date'], intercept=True)

prices = prices.drop(['Date'], axis=1)
prices

dataset = prices.values
X = dataset[:,1].reshape(-1,1)
Y = dataset[:,0:1]

validation_size = 0.15
seed = 7

X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
In [470]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = "r2"

# Spot-Check Algorithms
models = []
models.append((' LR ', LinearRegression()))
models.append((' LASSO ', Lasso()))
models.append((' EN ', ElasticNet()))
models.append((' KNN ', KNeighborsRegressor()))
models.append((' CART ', DecisionTreeRegressor()))
models.append((' SVR ', SVR()))
In [471]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    # print(cv_results)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
 LR : 0.999987 (0.000001)
 LASSO : 0.999987 (0.000001)
 EN : 0.999987 (0.000001)
 KNN : 1.000000 (0.000000)
 CART : 0.999999 (0.000000)
 SVR : 0.806959 (0.008756)
In [ ]:
 
In [473]:
# Future prediction, add dates here for which you want to predict
dates = ["2021-12-23", "2022-12-24", "2023-12-25", "2024-12-26", "2025-12-27",]
#convert to time stamp
for dt in dates:
  datetime_object = datetime.strptime(dt, "%Y-%m-%d")
  timestamp = datetime.timestamp(datetime_object)
  # to array X
  np.append(X, int(timestamp))


# Define model
model = DecisionTreeRegressor()
# Fit to model
model.fit(X_train, Y_train)
# predict
predictions = model.predict(Y)
print(mean_squared_error(Y, predictions))

# %matplotlib inline 
fig= plt.figure(figsize=(12,5))
plt.plot(X,Y)
plt.show()
9378356.0
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Asking and Answering Questions

Standard questions for financial investors.
Here we will be using the Metadata stock csv and entire Nifty50 companies csv, to answer these questions, since using One company it is good to Plot, but coming to answer Queries, going with these large csv will be good to go.

The Few questions have been taken from the data, so as to solve these using Accordingly

  • How to predict the overall Ranking list of Industry and Company by taking mean over Volume and Trades?
  • Create a dataframe according to Ranking on the basis of trading stock Since Inception (using Trades column) and sort according to maximum Trade occured
  • Which are the Top 5 Best and Worst stocks SI, implement using delta price percentage (Since Inception/Beginning)?
  • How to get the Maximum Close stock as per Company by implementing Standard Deviation?
  • How to get the Percentages of all Industries involved in Nifty50 using Pie Plot?

Q1: How to predict the overall Ranking list of Industry and Company by taking mean over Volume and Trades?

In [360]:
# Ranking list over Industry
quest_1a = nse_india_df.groupby('Industry')[['Volume', 'Trades']].mean()
quest_1a.sort_values(by=['Volume'], ascending=False)
Out[360]:
Volume Trades
Industry
METALS 7.027916e+06 64367.712454
TELECOM 6.424966e+06 85133.370114
MEDIA & ENTERTAINMENT 4.902652e+06 62645.709283
ENERGY 4.380083e+06 61415.426303
FINANCIAL SERVICES 4.092180e+06 88914.661690
SERVICES 3.909287e+06 47114.489778
FERTILISERS & PESTICIDES 2.868241e+06 51996.825081
AUTOMOBILE 2.823390e+06 54213.138957
IT 2.127041e+06 72581.542822
CONSUMER GOODS 2.007288e+06 44949.679492
CONSTRUCTION 1.917127e+06 85699.350977
PHARMA 1.406101e+06 53481.115499
CEMENT & CEMENT PRODUCTS 2.565819e+05 17924.217427
In [361]:
# Ranking list over Company name
quest_1b = nse_india_df.groupby('Company Name')[['Volume', 'Trades']].mean().head(5)
quest_1b.sort_values(by=['Volume'], ascending=False)
Out[361]:
Volume Trades
Company Name
Axis Bank Ltd. 6.916522e+06 120602.231678
Adani Ports and Special Economic Zone Ltd. 3.909287e+06 47114.489778
Bajaj Finance Ltd. 1.201285e+06 56423.101384
Asian Paints Ltd. 5.096721e+05 41346.011401
Bajaj Auto Ltd. 4.114639e+05 27487.876629

Q2: Create a dataframe according to Ranking on the basis of trading stock Since Inception (using Trades column) and sort according to maximum Trade occured

In [362]:
quest_2 = nse_india_df.groupby(['Company Name']).agg({'Trades': ['sum', 'mean']})
In [363]:
quest_2.columns = ['Trades_sum','Trades_mean']
In [364]:
quest_2.sort_values(by=['Trades_sum'], ascending=False).head(10)
Out[364]:
Trades_sum Trades_mean
Company Name
State Bank of India 370125997.0 150702.767508
Reliance Industries Ltd. 356686454.0 145230.640879
ICICI Bank Ltd. 339830893.0 138367.627443
Tata Motors Ltd. 315448583.0 128439.976792
Axis Bank Ltd. 296199081.0 120602.231678
Infosys Ltd. 273726787.0 112367.318144
Housing Development Finance Corporation Ltd. 250902630.0 102159.051303
ITC Ltd. 232248589.0 94563.757736
Tata Steel Ltd. 230788524.0 93969.268730
Larsen & Toubro Ltd. 210477606.0 85699.350977

Q3: Which are the Top 5 Best and Worst stocks SI, implement using delta price percentage (Since Inception/Beginning)?

In [372]:
quest_3_a = nse_india_df.groupby('Company Name')[['Company Name','Date', 'Close']].head(1)
quest_3_b = nse_india_df.groupby('Company Name')[['Company Name','Date', 'Close']].tail(1)
In [373]:
quest_3 = quest_3_a.merge(q_3_b, on="Company Name")
quest_3["Price_%_S-I"] = ((quest_3["Close_y"] - quest_3["Close_x"])/q_3["Close_x"])*100
In [369]:
# Top 5 Best Companies as per Price Percentage since its beginning
quest_3.sort_values(by=['Price_%_S-I'], ascending=False).head(5)
Out[369]:
Company Name Date_x Close_x Date_y Close_y Price_%_S-I
37 Shree Cement Ltd. 2000-01-03 64.75 2021-04-30 27910.50 43005.019305
12 Eicher Motors Ltd. 2000-01-03 48.85 2021-04-30 2421.65 4857.318321
30 Maruti Suzuki India Ltd. 2003-07-09 164.30 2021-04-30 6455.65 3829.184419
22 IndusInd Bank Ltd. 2000-01-03 28.95 2021-04-30 934.95 3129.533679
44 UltraTech Cement Ltd. 2004-08-24 260.20 2021-04-30 6278.95 2313.124520
In [374]:
# Top 5 Worst Companies as per Price Percentage since its beginning
quest_3.sort_values(by=['Price_%_S-I'], ascending=True).head(5)
Out[374]:
Company Name Date_x Close_x Date_y Close_y Price_%_S-I
47 Wipro Ltd. 2000-01-03 2724.20 2021-04-30 492.75 -81.912121
25 ITC Ltd. 2000-01-03 708.50 2021-04-30 202.60 -71.404375
24 Indian Oil Corporation Ltd. 2000-01-03 270.85 2021-04-30 90.85 -66.457449
38 Sun Pharmaceutical Industries Ltd. 2000-01-03 1835.05 2021-04-30 654.45 -64.336122
10 Coal India Ltd. 2010-11-04 342.55 2021-04-30 133.05 -61.158955

Q4: Standard deviation is a statistical term that measures the amount of variability or dispersion around an average. Standard deviation is also a measure of volatility: using Numpy is easy to calculate it for each stock or sector into Index.

  • With SD, we could calculate coefficient of variation (CV=standard deviation / mean).
  • As a rule of thumb, a CV >= 1 indicates a relatively high variation, while a CV < 1 can be considered low.
  • This means that distributions with a coefficient of variation higher than 1 are considered to be high variance whereas those with a CV lower than 1 are considered to be low-variance.

How to get the Maximum Close stock as per Company using SD?

In [381]:
quest_4_a = nse_india_df.groupby(['Company Name'])['Close'].std() / nse_india_df.groupby(['Company Name'])['Close'].mean()
quest_4_a.to_frame().sort_values(by = 'Close', ascending=False)
Out[381]:
Close
Company Name
Eicher Motors Ltd. 1.437204
Shree Cement Ltd. 1.186481
IndusInd Bank Ltd. 1.138362
Titan Company Ltd. 1.106728
Wipro Ltd. 1.052599
Bajaj Finserv Ltd. 1.041638
Maruti Suzuki India Ltd. 0.936998
State Bank of India 0.888053
Asian Paints Ltd. 0.861365
Britannia Industries Ltd. 0.808403
Oil & Natural Gas Corporation Ltd. 0.784296
ITC Ltd. 0.780837
Hindustan Unilever Ltd. 0.751550
Bajaj Finance Ltd. 0.747161
UltraTech Cement Ltd. 0.708839
ICICI Bank Ltd. 0.669198
Grasim Industries Ltd. 0.657108
HCL Technologies Ltd. 0.641064
HDFC Bank Ltd. 0.631049
Dr. Reddy's Laboratories Ltd. 0.598303
JSW Steel Ltd. 0.589857
Tech Mahindra Ltd. 0.578754
Infosys Ltd. 0.565269
Reliance Industries Ltd. 0.564657
Tata Motors Ltd. 0.563037
Nestle India Ltd. 0.556800
Housing Development Finance Corporation Ltd. 0.552485
Kotak Mahindra Bank Ltd. 0.550144
Mahindra & Mahindra Ltd. 0.539391
Cipla Ltd. 0.535048
Sun Pharmaceutical Industries Ltd. 0.483692
Zee Entertainment Enterprises Ltd. 0.448082
GAIL (India) Ltd. 0.443749
Indian Oil Corporation Ltd. 0.434258
Axis Bank Ltd. 0.430713
Tata Consultancy Services Ltd. 0.426743
Bharat Petroleum Corporation Ltd. 0.414824
Adani Ports and Special Economic Zone Ltd. 0.407523
Bharti Airtel Ltd. 0.400638
Vedanta Ltd. 0.396787
Larsen & Toubro Ltd. 0.382986
Hindalco Industries Ltd. 0.355664
Bajaj Auto Ltd. 0.353858
UPL Ltd. 0.346828
Tata Steel Ltd. 0.322052
Power Grid Corporation of India Ltd. 0.296699
Coal India Ltd. 0.257488
NTPC Ltd. 0.237479
Hero MotoCorp Ltd. 0.233037
In [382]:
# Now applying the same operation on All Unique Industries upon High gained Stock
quest_4_b = nse_india_df.groupby(['Industry'])['High'].std() / nse_india_df.groupby(['Industry'])['High'].mean()
quest_4_b.to_frame().sort_values(by = 'High', ascending=False)
Out[382]:
High
Industry
AUTOMOBILE 1.900448
CONSUMER GOODS 1.548981
CEMENT & CEMENT PRODUCTS 1.471718
FINANCIAL SERVICES 1.168164
ENERGY 0.930813
PHARMA 0.852903
IT 0.779119
METALS 0.751518
MEDIA & ENTERTAINMENT 0.443273
SERVICES 0.407076
TELECOM 0.404383
CONSTRUCTION 0.386303
FERTILISERS & PESTICIDES 0.344067

Q5: How to get the Percentages of all Industries involved in Nifty50 using Pie Plot?

In [383]:
#Nifty 50 Index/Industry Composition
quest_3 = stock_metadata_df.Industry.value_counts()
plt.title("Nifty 50 Index composition")
plt.pie(quest_3, labels=quest_3.index, autopct='%1.1f%%', startangle=180, radius=1.72);

Inferences and Conclusion

It was great to analyze the data about stock market from NSE. Since Inception, most stocks have aims a great gain; it's strange to verify that the best performers are not in top 5 trading stocks (but, truly, it's in line with Warren Buffet's guideline).

Analysis could be more interested with Market CAP column, that is missing in this Dataset.

In [ ]:
 

References and Future Work

This analysis could be a starting point for people interesting on financial markets data. This dataset is still missing some core information (like capital markets for each stocks) but this information could maybe found using some others data providers (like Yahoo Finance) and maybe become a useful instruments for trading activity.

Links for tips: https://stackoverflow.com/ https://pandas.pydata.org/